Many colleges want to optimize the money they receive from their alumni. In order to do so, they need to identify and predict the salary/unemployment rate of recent graduates based on their education and other various factors. Doing so, they will be able to put more money into those programs to get a larger return on their investments (students).
Business Question:
Where can colleges put money in order to optimize the amount of money they receive from recent graduates?
Analysis Question:
Based on recent graduates and their characteristics/education, what would be their predicted median salary? Would they make over or less than six figures?
This data is pulled from the 2012-12 American Community Survey Public Use Microdata Series, and is limited to those users under the age of 28. The general purpose of this code and data is based upon this story. This story describes the dilemma among college students about choosing the right major, considering the financial benefits of the field and the its maximized chance to graduate. It breaks down the overarching majors like “Engineering” and “STEM,” and dives deeper into what each major means in terms of later financial stability and its popularity in comparison to other majors. The actual dataset contains a detailed breakdown about basic earnings as well as labor force information, taking into account sex and the type of job acquired post graduation.
A brief look at the raw data can be found below.
## 'data.frame': 172 obs. of 21 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ Major : chr "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category : chr "Engineering" "Engineering" "Engineering" "Engineering" ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round: int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ Median : int 110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
## - attr(*, "na.action")= 'omit' Named int 22
## ..- attr(*, "names")= chr "22"
As can be seen above, many of the categories are integer values. Many of these variables can be converted into factor variables in addition to the numerical ones. In addition, the variables Rank, Major Code, and Major can be dropped as the Rank variable highly correlates with the salary variable, and the other two are to specific and cannot be generalized.
majors_added_categorical <- majors_raw %>% mutate(Over.50K = ifelse(Median > 50000, "Over", "Under.Equal"), High.Unemployment = ifelse(Unemployment_rate > 0.5, "High", "Low")) %>% select(-1, -2, -3)
In addition, the categorical variable categories can be compressed in order for more useful data for the analysis.
##
## Sciences Arts Other STEM
## 54 30 48 40
In order to do some analysis, all categorical variables need to be one hot encoded, which is done below:
# One Hot Encoded Data
majors_onehot <- one_hot(data.table(majors_factors), cols = c("Major_category", "High.Unemployment"))
# Normal Data
majors <- majors_factors
Before beginning with the analytical part of the exploration, it is beneficial to visualize and summarize the data in order to get a better understanding of the data in its entirety, and with an emphasis on variables you believe to be important for your analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40077 45000 110000
## Total Men Women ShareWomen Sample_size Employed Full_time
## Total 1.0000000 0.8780884 0.9447645 0.1429993 0.9455747 0.9962140 0.9893392
## Men 0.8780884 1.0000000 0.6727589 -0.1120136 0.8751756 0.8706047 0.8935631
## Women 0.9447645 0.6727589 1.0000000 0.2978321 0.8626064 0.9440365 0.9176812
## Part_time Full_time_year_round Unemployed Unemployment_rate Median
## Total 0.9502684 0.9811118 0.9747684 0.08319170 -0.1067377
## Men 0.7515917 0.8924540 0.8694115 0.10150234 0.0259906
## Women 0.9545133 0.9057195 0.9116943 0.05910776 -0.1828419
## P25th P75th College_jobs Non_college_jobs Low_wage_jobs
## Total -0.07192608 -0.08319767 0.8004648 0.9412471 0.9355096
## Men 0.03872518 0.05239290 0.5631684 0.8514998 0.7913360
## Women -0.13773826 -0.16452834 0.8519460 0.8721318 0.9044699
The above confusion matrix details the correlation coefficients between all the respective variables with “Total,” “Men,” and “Women.” The correlation coefficient is a measure of the relationship strength between two different variables, with the magnitude closest to 1 or -1 indicating there is a strong direct and/or indirect relationship. Based on the output, it is important to note the differences among the “Employed” between men and women. Comparatively, there is a stronger direct relationship between women being employed (~0.945) when compared to men (~0.878). Similarly, women are more prone to work part-time (~0.917) when compared to men (~0.894). On the other hand, when comparing the median variable, which describes the median earnings of full-time year-round workers, women tend to have a slight inverse relationship (~ -0.182) whereas men have a slight direct relationship (~0.025). This is an important dissimilarity, considering women are slightly more employed yet do not payed as much in comparison
Now, we can visualize the dataset. To do this, we used the ggplot and plot_ly packages.
As can be seen above, the first graph we created is a polar graph. A polar graph allows the reader to understand the sampling distribution, as well as the amount of representation each major category has in the dataset. The larger the slice, the more representation the category has in the dataset. From the polar chart, Sciences has the largest amount of representation, followed closely by the Other category. STEM is third, but by a large margin, and Arts is last.
The next graph we created was a stacked bar graph. The major category is on the x-axis, while the count - normalized to be between 0 and 1 - is on the y-axis. The fill of the graph is based on whether or not a person from that category has a median salary that is larger than $50,000. From this graph, it seems that STEM majors have almost 50 percent of their category making above 50K per year - the largest percentage of the four major categories. The other three major categories are nowhere close to STEM, with the Other category coming in second with about 7 percent of their category making above 50K. Science is third with what seems to be about 1 percent of their category making above 50K, and Art is last with what seems to be 0 percent of their category making above 50K per year.
For our third graph, we decided to make a box-plot graph where the x-axis is the median salary and the y-axis is the four major categories. From this graph, it can be deduced that the range of STEM majors is higher than that of any other major. The range of STEM majors seems to be about 40-50K, whereas the other majors have a maximum range of 30K. There is a STEM major who currently has a median salary of 120K, which is almost double the highest median salary of any other major category. Another interesting aspect about the STEM box-plot, when compared to the other three, is that the median salary for the 25th percentile of STEM is equal to about 45K, which is higher than the median salary of the 75th percentile for any other category. The other three boxplots are relatively similar to each other, with the Art category being much skinnier than the other two. The skinnier the graph, the smaller the range of the graph.
Our final graph above is a three-dimensional scatterplot. The unemployment rate on a scale of 0-1 is on the x-axis, the share of women as a decimal is on the y-axis, and the z-axis has the salary of the low-wage jobs that people work. The color of the marker is dependent on how much money each person makes, and uses a gradient color scheme. From the graph, it can be determined that women do not make as much money as men as they are working in lower-wage jobs and have higher rates of unemployment. Another interesting thing to note is that only one student overall made above a 100K median salary.
## [1] 172 22
## [1] 121 22
## [1] 26 22
## [1] 25 22
## Classes 'data.table' and 'data.frame': 121 obs. of 21 variables:
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category_Sciences: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Major_category_Arts : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Major_category_Other : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Major_category_STEM : int 1 1 1 1 1 1 0 0 1 1 ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round : int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
## $ High.Unemployment_Low : int 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
median_mdl
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9103030 0.5955467
## rules FALSE 10 0.9159091 0.6193462
## rules FALSE 20 0.9209324 0.6407262
## rules TRUE 1 0.9150583 0.5344605
## rules TRUE 10 0.9079604 0.5216761
## rules TRUE 20 0.9144988 0.5592476
## tree FALSE 1 0.9127040 0.6083058
## tree FALSE 10 0.9193939 0.6468462
## tree FALSE 20 0.9243939 0.6600929
## tree TRUE 1 0.9117249 0.5339766
## tree TRUE 10 0.9146503 0.5458891
## tree TRUE 20 0.9163170 0.5668338
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
Based on the accuracies and the respective kappa values for each set of trials, both with and without winnowing, the final model was chosen with 1 trial and without winnowing. This indicates that the lowest possible number of boosting iterations is favorable. The accuracy for winnowing with 20 trials was approximately 0.8834 and comparatively the accuracy for no winnowing with 1 trials was approximately 0.9295. Based on those values, the difference between the number of trials along with factoring in no winnowing or winnowing was a good amount; therefore, it is important to highlight that no winnowing and a small trial number would be the most suitable option for constructing this model.
# plot the model
plot(median_mdl)
Graphically, the difference between no winnowing (FALSE) and winnowing (TRUE) and the respective trials can be visualized. As seen in the graph, the graph trends downwards as the number of trials increased. The FALSE graph is also much higher relating to accuracy, when compared to TRUE. This visualization can be supported by the previous output that identified 1 trial and no winnowing, as the most favorable final model.
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 2 1
## Under.Equal 2 21
##
## Accuracy : 0.8846
## 95% CI : (0.6985, 0.9755)
## No Information Rate : 0.8462
## P-Value [Acc > NIR] : 0.417
##
## Kappa : 0.5063
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.50000
## Specificity : 0.95455
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.91304
## Precision : 0.66667
## Recall : 0.50000
## F1 : 0.57143
## Prevalence : 0.15385
## Detection Rate : 0.07692
## Detection Prevalence : 0.11538
## Balanced Accuracy : 0.72727
##
## 'Positive' Class : Over
##
From the generated confusion matrix, the most useful metrics for analysis would be the accuracy coupled with the F1 score. The goal is to have accuracy be as close to 1 as possible; therefore, with that understanding, the value of 0.9615 is very optimal. It could continue to be optimized to have the value be closer to 1, but that value would be considered “great” in terms of the model as a whole. The F1 score is a measure of the model’s accuracy. The value of 0.8571 from the statistics indicates that the model is pretty accurate but similar to the accuracy, it should also be improved in order to be as close to 1 as possible.
## C5.0 variable importance
##
## only 20 most important variables shown (out of 21)
##
## Overall
## Men 100.00
## Major_category_Other 100.00
## P75th 100.00
## P25th 100.00
## Major_category_STEM 99.17
## Unemployment_rate 97.52
## Low_wage_jobs 66.12
## ShareWomen 41.32
## Women 30.58
## Sample_size 29.75
## Major_category_Sciences 24.79
## Non_college_jobs 20.66
## Total 10.74
## Part_time 0.00
## Full_time_year_round 0.00
## Full_time 0.00
## Unemployed 0.00
## High.Unemployment_Low 0.00
## Major_category_Arts 0.00
## Employed 0.00
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 20 0.9209324 0.6407262
## rules FALSE 30 0.9145221 0.6129751
## rules FALSE 40 0.9143939 0.6021384
## rules TRUE 20 0.9144988 0.5592476
## rules TRUE 30 0.9144988 0.5592476
## rules TRUE 40 0.9144988 0.5592476
## tree FALSE 20 0.9243939 0.6600929
## tree FALSE 30 0.9210606 0.6490401
## tree FALSE 40 0.9224709 0.6456319
## tree TRUE 20 0.9179837 0.5743338
## tree TRUE 30 0.9179837 0.5743338
## tree TRUE 40 0.9179837 0.5743338
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
## C5.0
##
## 121 samples
## 21 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 108, 110, 109, 110, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9103030 0.5955467
## rules FALSE 10 0.9159091 0.6193462
## rules FALSE 20 0.9209324 0.6407262
## rules TRUE 1 0.9150583 0.5344605
## rules TRUE 10 0.9079604 0.5216761
## rules TRUE 20 0.9144988 0.5592476
## tree FALSE 1 0.9127040 0.6083058
## tree FALSE 10 0.9193939 0.6468462
## tree FALSE 20 0.9243939 0.6600929
## tree TRUE 1 0.9117249 0.5339766
## tree TRUE 10 0.9146503 0.5458891
## tree TRUE 20 0.9163170 0.5668338
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = tree and winnow
## = FALSE.
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 2 1
## Under.Equal 2 21
##
## Accuracy : 0.8846
## 95% CI : (0.6985, 0.9755)
## No Information Rate : 0.8462
## P-Value [Acc > NIR] : 0.417
##
## Kappa : 0.5063
##
## Mcnemar's Test P-Value : 1.000
##
## Sensitivity : 0.50000
## Specificity : 0.95455
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.91304
## Prevalence : 0.15385
## Detection Rate : 0.07692
## Detection Prevalence : 0.11538
## Balanced Accuracy : 0.72727
##
## 'Positive' Class : Over
##
## Confusion Matrix and Statistics
##
## Actual
## Prediction Over Under.Equal
## Over 3 0
## Under.Equal 0 22
##
## Accuracy : 1
## 95% CI : (0.8628, 1)
## No Information Rate : 0.88
## P-Value [Acc > NIR] : 0.04093
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Prevalence : 0.12
## Detection Rate : 0.12
## Detection Prevalence : 0.12
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Over
##
## [1] 0.3953488
##
## LE.EQ.20K G.50K
## 104 68
## [1] 121 21
## [1] 25 21
## [1] 26 21
## [1] 4.472136
## X1.nrow.combined_RF.err.rate. OOB LE.EQ.20K G.50K
## 1 1 0.2982456 0.3666667 0.2222222
## 2 2 0.2325581 0.2549020 0.2000000
## 3 3 0.2475248 0.2372881 0.2619048
## 4 4 0.2110092 0.1904762 0.2391304
## 5 5 0.2280702 0.1617647 0.3260870
## 6 6 0.2288136 0.1830986 0.2978723
## 7 7 0.2000000 0.1506849 0.2765957
## 8 8 0.2250000 0.1917808 0.2765957
## 9 9 0.1735537 0.1095890 0.2708333
## 10 10 0.1900826 0.1232877 0.2916667
## 'data.frame': 121 obs. of 21 variables:
## $ Total : int 2339 756 1258 32260 3777 1792 91227 81527 15058 14955 ...
## $ Men : int 2057 679 1123 21239 2110 832 80320 65511 12953 8407 ...
## $ Women : int 282 77 135 11021 1667 960 10907 16016 2105 6548 ...
## $ Major_category : Factor w/ 4 levels "Sciences","Arts",..: 4 4 4 4 3 1 4 4 4 4 ...
## $ ShareWomen : num 0.121 0.102 0.107 0.342 0.441 ...
## $ Sample_size : int 36 7 16 289 51 10 1029 631 147 79 ...
## $ Employed : int 1976 640 758 25694 2912 1526 76442 61928 11391 10047 ...
## $ Full_time : int 1849 556 1069 23170 2924 1085 71298 55450 11106 9017 ...
## $ Part_time : int 270 170 150 5180 296 553 13101 12695 2724 2694 ...
## $ Full_time_year_round: int 1207 388 692 16697 2482 827 54639 41413 8790 5986 ...
## $ Unemployed : int 37 85 40 1672 308 33 4650 3895 794 1019 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0501 0.0611 0.0957 ...
## $ Median : int 110000 75000 70000 65000 62000 62000 60000 60000 60000 60000 ...
## $ P25th : int 95000 55000 43000 50000 53000 31500 48000 45000 42000 36000 ...
## $ P75th : int 125000 90000 80000 75000 72000 109000 70000 72000 70000 70000 ...
## $ College_jobs : int 1534 350 529 18314 1768 972 52844 45829 8184 6439 ...
## $ Non_college_jobs : int 364 257 102 4440 314 500 16384 10874 2425 2471 ...
## $ Low_wage_jobs : int 193 50 0 972 259 220 3253 3170 372 789 ...
## $ Over.50K : Factor w/ 2 levels "Over","Under.Equal": 1 1 1 1 1 1 1 1 1 1 ...
## $ High.Unemployment : Factor w/ 1 level "Low": 1 1 1 1 1 1 1 1 1 1 ...
## $ combined_target : Factor w/ 2 levels "LE.EQ.20K","G.50K": 1 1 1 2 2 2 1 1 1 2 ...
## mtry = 4 OOB error = 20.66%
## Searching left ...
## mtry = 2 OOB error = 19.01%
## 0.08 0.05
## mtry = 1 OOB error = 29.75%
## -0.5652174 0.05
## Searching right ...
## mtry = 8 OOB error = 14.88%
## 0.2173913 0.05
## mtry = 16 OOB error = 9.09%
## 0.3888889 0.05
## mtry = 20 OOB error = 11.57%
## -0.2727273 0.05
##
## Call:
## randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 12.4%
## Confusion matrix:
## LE.EQ.20K G.50K class.error
## LE.EQ.20K 65 8 0.1095890
## G.50K 7 41 0.1458333
Because the built in Random Forest Model was not agreeable with the tuning done with the caret library, an original random forest classification tuning metric was created in order to determine the best values for the three hyperparameters determined above.
Now, we can set the hyperparameter values to try and tune the model.
## .mtry .sampsize .ntree
## 1 3 50 200
## 2 4 50 200
## 3 5 50 200
## 4 3 100 200
## 5 4 100 200
## 6 5 100 200
## 7 3 200 200
## 8 4 200 200
## 9 5 200 200
## 10 3 50 300
## 11 4 50 300
## 12 5 50 300
## 13 3 100 300
## 14 4 100 300
## 15 5 100 300
## 16 3 200 300
## 17 4 200 300
## 18 5 200 300
## 19 3 50 400
## 20 4 50 400
## 21 5 50 400
## 22 3 100 400
## 23 4 100 400
## 24 5 100 400
## 25 3 200 400
## 26 4 200 400
## 27 5 200 400
## 121 samples
## 19 predictor
## 2 classes: 'Over', 'Under.Equal'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 97, 97, 97, 96, 97, 97, ...
## Resampling results across tuning parameters:
##
## mtry sampsize ntree ROC Sens Spec
## 3 50 200 0.9903810 0.8533333 1.0000000
## 3 50 300 0.9910159 0.8300000 1.0000000
## 3 50 400 0.9910159 0.8266667 1.0000000
## 3 100 200 0.9913333 0.8500000 1.0000000
## 3 100 300 0.9871429 0.8266667 1.0000000
## 3 100 400 0.9910159 0.8300000 1.0000000
## 3 200 200 0.9910159 0.8300000 1.0000000
## 3 200 300 0.9897460 0.8400000 1.0000000
## 3 200 400 0.9903492 0.8400000 0.9980952
## 4 50 200 0.9913333 0.8633333 1.0000000
## 4 50 300 0.9916508 0.8666667 1.0000000
## 4 50 400 0.9910159 0.8533333 1.0000000
## 4 100 200 0.9909841 0.9000000 1.0000000
## 4 100 300 0.9897143 0.8666667 1.0000000
## 4 100 400 0.9916508 0.8766667 1.0000000
## 4 200 200 0.9906984 0.8766667 1.0000000
## 4 200 300 0.9925873 0.8666667 1.0000000
## 4 200 400 0.9929206 0.8533333 1.0000000
## 5 50 200 0.9916508 0.9233333 1.0000000
## 5 50 300 0.9910159 0.8966667 1.0000000
## 5 50 400 0.9922857 0.8966667 1.0000000
## 5 100 200 0.9903810 0.8966667 1.0000000
## 5 100 300 0.9916508 0.8866667 1.0000000
## 5 100 400 0.9916508 0.9100000 1.0000000
## 5 200 200 0.9922857 0.8866667 1.0000000
## 5 200 300 0.9910159 0.8866667 1.0000000
## 5 200 400 0.9916508 0.8633333 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, ntree = 400 and sampsize
## = 200.
# Evaluation of Model
I believe that our model is fair and accounts for the singular protected class present in our dataset - women. Our dataset has a variable that accounts for the percentage of the workforce that women hold. Because of this, the models that we create are able to determine whether women are being treated fairly in the workplace or not. For example, our linear regression model is able to output that women are being paid an unfair median salary based on the amount of the market share they have. If our dataset did not have the sharewomen variable, our model would not be able to determine whether women are being paid fairly or not. ### Conclusion
What can you say about the results of the methods section as it relates to your question given the limitations to your model?
One additional piece of analysis that would benefit the report as a whole is using recently recorded data. The data that was used in this analysis was recorded from 2010-2012, so the trends that were discovered from our analysis are most likely outdated. Having new data would greatly benefit the university that wanted this report, as they would be able to adjust major categories based on newer trends rather than older ones. Another additional piece of analysis that would benefit our report would be the addition of the decision tree model. In our analysis, we included linear regression and the random forest model. However, we never include the decision tree model, which would have allowed us to see a model where the most optimal choice was made every time - since the decision tree is a greedy algorithm by nature. Including the decision tree would have made our analysis more diverse and well-rounded, as we would have had performed an analysis using three different major analytic methods. Personally, we don’t believe that anything limited our analysis on the project - the dataset was easy to work with and the models that we created learned the data efficiently and effectively.